Source Link: https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data
Summary information about the data: Since 2008, guests and hosts have used Airbnb to expand on traveling possibilities and present more unique, personalized way of experiencing the world. This dataset describes the listing activity and metrics in NYC, NY for 2019. This data file includes all needed information to find out more about hosts, geographical availability, necessary metrics to make predictions and draw conclusions. This public dataset is part of Airbnb, and the original source can be found on this website.
This dataset has 48895 observations of 16 Variables.
Variables in Dataset are
id : Listing ID
name : Listing Name
host_id : Host ID
host_name : Name of the Host
neighbourhood_group : Location
neighbourhood : Area
latitude : Latitude Coordinates
longitude : Longitude Coordinates
room_type : Listing Space Type
price : Price in Dollars
minimum_nights : amount of nights minimum
number_of_reviews : Number of Reviews
last_review : Latest Review
reviews_per_month : Number of Reviews per Month
calculated_host_listings_count : amount of listing per host
availability_365 : number of days when listing is available for booking
Variable types in Dataset are
id : Integer
name : Factor
host_id : Integer
host_name : Factor
neighbourhood_group : Factor
neighbourhood : Factor
latitude : Numberic
longitude : Numberic
room_type : Factor
price : Integer
minimum_nights : Integer
number_of_reviews : Integer
last_review : Factor
reviews_per_month : Numberic
calculated_host_listings_count : Integer
availability_365 : Integer
Loading Packages
library(tidyverse)
library(here)
library(ggplot2)
library(gridExtra)
To call the airbnb data we used here function that is loading data
airbnb<- read_csv(here("Data","AB_NYC_2019.csv"))
## Parsed with column specification:
## cols(
## id = col_double(),
## name = col_character(),
## host_id = col_double(),
## host_name = col_character(),
## neighbourhood_group = col_character(),
## neighbourhood = col_character(),
## latitude = col_double(),
## longitude = col_double(),
## room_type = col_character(),
## price = col_double(),
## minimum_nights = col_double(),
## number_of_reviews = col_double(),
## last_review = col_date(format = ""),
## reviews_per_month = col_double(),
## calculated_host_listings_count = col_double(),
## availability_365 = col_double()
## )
Calling Airbnb dataset
airbnb
Structure and features
str(airbnb)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 48895 obs. of 16 variables:
## $ id : num 2539 2595 3647 3831 5022 ...
## $ name : chr "Clean & quiet apt home by the park" "Skylit Midtown Castle" "THE VILLAGE OF HARLEM....NEW YORK !" "Cozy Entire Floor of Brownstone" ...
## $ host_id : num 2787 2845 4632 4869 7192 ...
## $ host_name : chr "John" "Jennifer" "Elisabeth" "LisaRoxanne" ...
## $ neighbourhood_group : chr "Brooklyn" "Manhattan" "Manhattan" "Brooklyn" ...
## $ neighbourhood : chr "Kensington" "Midtown" "Harlem" "Clinton Hill" ...
## $ latitude : num 40.6 40.8 40.8 40.7 40.8 ...
## $ longitude : num -74 -74 -73.9 -74 -73.9 ...
## $ room_type : chr "Private room" "Entire home/apt" "Private room" "Entire home/apt" ...
## $ price : num 149 225 150 89 80 200 60 79 79 150 ...
## $ minimum_nights : num 1 1 3 1 10 3 45 2 2 1 ...
## $ number_of_reviews : num 9 45 0 270 9 74 49 430 118 160 ...
## $ last_review : Date, format: "2018-10-19" "2019-05-21" ...
## $ reviews_per_month : num 0.21 0.38 NA 4.64 0.1 0.59 0.4 3.47 0.99 1.33 ...
## $ calculated_host_listings_count: num 6 2 1 1 1 1 1 1 1 4 ...
## $ availability_365 : num 365 355 365 194 0 129 0 220 0 188 ...
## - attr(*, "spec")=
## .. cols(
## .. id = col_double(),
## .. name = col_character(),
## .. host_id = col_double(),
## .. host_name = col_character(),
## .. neighbourhood_group = col_character(),
## .. neighbourhood = col_character(),
## .. latitude = col_double(),
## .. longitude = col_double(),
## .. room_type = col_character(),
## .. price = col_double(),
## .. minimum_nights = col_double(),
## .. number_of_reviews = col_double(),
## .. last_review = col_date(format = ""),
## .. reviews_per_month = col_double(),
## .. calculated_host_listings_count = col_double(),
## .. availability_365 = col_double()
## .. )
Brief Summary of dataset
summary(airbnb)
## id name host_id host_name
## Min. : 2539 Length:48895 Min. : 2438 Length:48895
## 1st Qu.: 9471945 Class :character 1st Qu.: 7822033 Class :character
## Median :19677284 Mode :character Median : 30793816 Mode :character
## Mean :19017143 Mean : 67620011
## 3rd Qu.:29152178 3rd Qu.:107434423
## Max. :36487245 Max. :274321313
##
## neighbourhood_group neighbourhood latitude longitude
## Length:48895 Length:48895 Min. :40.50 Min. :-74.24
## Class :character Class :character 1st Qu.:40.69 1st Qu.:-73.98
## Mode :character Mode :character Median :40.72 Median :-73.96
## Mean :40.73 Mean :-73.95
## 3rd Qu.:40.76 3rd Qu.:-73.94
## Max. :40.91 Max. :-73.71
##
## room_type price minimum_nights number_of_reviews
## Length:48895 Min. : 0.0 Min. : 1.00 Min. : 0.00
## Class :character 1st Qu.: 69.0 1st Qu.: 1.00 1st Qu.: 1.00
## Mode :character Median : 106.0 Median : 3.00 Median : 5.00
## Mean : 152.7 Mean : 7.03 Mean : 23.27
## 3rd Qu.: 175.0 3rd Qu.: 5.00 3rd Qu.: 24.00
## Max. :10000.0 Max. :1250.00 Max. :629.00
##
## last_review reviews_per_month calculated_host_listings_count
## Min. :2011-03-28 Min. : 0.010 Min. : 1.000
## 1st Qu.:2018-07-08 1st Qu.: 0.190 1st Qu.: 1.000
## Median :2019-05-19 Median : 0.720 Median : 1.000
## Mean :2018-10-04 Mean : 1.373 Mean : 7.144
## 3rd Qu.:2019-06-23 3rd Qu.: 2.020 3rd Qu.: 2.000
## Max. :2019-07-08 Max. :58.500 Max. :327.000
## NA's :10052 NA's :10052
## availability_365
## Min. : 0.0
## 1st Qu.: 0.0
## Median : 45.0
## Mean :112.8
## 3rd Qu.:227.0
## Max. :365.0
##
— Missing rows for “last_review” and “reviews_per_month” are the same, which makes sense. — Concluding that missing values do not require to be treated manually.
— Following variables or columns can be ommited since they don’t carry any useful information and hence wont’ be using in our analysis.
— Also we will convert room_type and neighbourhood_group into factor from as they are categorical variables but when we read it from csv R is considering it as continuous variables.
airbnb_data <- airbnb %>% select(-id, -name, -host_id, -host_name, -last_review, -neighbourhood) %>%
mutate(room_type = factor(room_type), neighbourhood_group = factor(neighbourhood_group), )
airbnb_data
— It is hard to work on price variable as it is dependent on minimum_nights so we cannot work on it so we will find price per night from the price and minimum_night variable
airbnb_clean <- airbnb_data %>% mutate(price_per_night = price/minimum_nights)
airbnb_clean
Exercise 1: Create an appropriate plot to visualize the distribution of this variable.
Creating a histogram for price_per_night variable as it is a distinct numberic variable
ggplot(airbnb_clean, aes(x = price_per_night)) +
geom_histogram(fill = 'skyblue', colour = 'black') + ggtitle("Distribution of Price")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
— Finding Maximum and Minimum value of price_per_night variable to find the range of Price
Minimum price_per_night
print(min(airbnb_clean$price_per_night))
## [1] 0
Maximum price_per_night
print(max(airbnb_clean$price_per_night))
## [1] 8000
Range of price_per_night
price_per_night_range = max(airbnb_clean$price_per_night) - min(airbnb_clean$price_per_night)
price_per_night_range
## [1] 8000
Dividing the price_per_night range by 30 to get a value for the default binwidth used by geom_histogram
default_bin = price_per_night_range/30
default_bin
## [1] 266.6667
Creating a histogram for price_per_night variable as it is a distinct numberic variable with proper binwidth
ggplot(airbnb_clean, aes(price_per_night)) +
geom_histogram(fill = 'skyblue', colour = 'black', binwidth = 267)
Exercise 2: Consider any outliers present in the data. If present, specify the criteria used to identify them and provide a logical explanation for how you handled them.
summary(airbnb_clean$price_per_night)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 20.00 44.50 70.17 81.50 8000.00
quantile(airbnb_clean$price_per_night,seq(0,1,by=0.1))
## 0% 10% 20% 30% 40% 50%
## 0.000000 6.166667 15.000000 24.500000 33.333333 44.500000
## 60% 70% 80% 90% 100%
## 55.500000 72.500000 96.666667 140.000000 8000.000000
— The average price is 70 Dollars per night in New York City according to the dataset and if we go on internet and if we find average price for the New York City hotel rooms then
— Search Results for Moderate New York Hotel Rooms — New York City Hotel room rates start at under $300 a night. This is the average price of a New York City hotel that is well located and full-service. — Link for the above statemant is below — https://www.google.com/search?client=firefox-b-d&q=average+room+price+in+new+york+city+per+night
quantile(airbnb_clean$price_per_night,seq(0.0,1,by=0.01))
## 0% 1% 2% 3% 4% 5%
## 0.000000 1.233333 1.637702 2.142857 2.741649 3.333333
## 6% 7% 8% 9% 10% 11%
## 3.928571 4.433333 5.000000 5.628833 6.166667 6.666667
## 12% 13% 14% 15% 16% 17%
## 7.428571 8.100000 8.972000 10.000000 10.714286 11.666667
## 18% 19% 20% 21% 22% 23%
## 12.800000 13.800000 15.000000 16.000000 16.666667 17.800000
## 24% 25% 26% 27% 28% 29%
## 18.750000 20.000000 20.000000 21.500000 22.500000 23.333333
## 30% 31% 32% 33% 34% 35%
## 24.500000 25.000000 25.600000 26.666667 28.000000 29.000000
## 36% 37% 38% 39% 40% 41%
## 30.000000 30.000000 32.000000 32.500000 33.333333 34.500000
## 42% 43% 44% 45% 46% 47%
## 35.000000 36.666667 37.500000 38.500000 40.000000 40.000000
## 48% 49% 50% 51% 52% 53%
## 41.666667 42.500000 44.500000 45.000000 46.666667 48.333333
## 54% 55% 56% 57% 58% 59%
## 49.500000 50.000000 50.000000 50.000000 52.500000 55.000000
## 60% 61% 62% 63% 64% 65%
## 55.500000 58.000000 60.000000 60.000000 61.666667 63.000000
## 66% 67% 68% 69% 70% 71%
## 65.000000 66.630667 68.500000 70.000000 72.500000 75.000000
## 72% 73% 74% 75% 76% 77%
## 75.000000 76.500000 80.000000 81.500000 85.000000 87.500000
## 78% 79% 80% 81% 82% 83%
## 90.000000 92.500000 96.666667 99.500000 100.000000 100.000000
## 84% 85% 86% 87% 88% 89%
## 105.000000 111.000000 116.666667 122.500000 125.000000 131.666667
## 90% 91% 92% 93% 94% 95%
## 140.000000 150.000000 150.000000 169.000000 180.000000 200.000000
## 96% 97% 98% 99% 100%
## 220.000000 250.000000 300.000000 443.000000 8000.000000
— We can see that our 99% of data lies under 443 dollars price per night. Which is also clearly shown in the above in detailed of quartiles function. So the data that comes after 443 that is 1% data which lies between 99% to 100% is considered as outliers. Also the properties listed with price of 0 dollars per night will also be considered as outliers as they are free properties.
To remove outliers
x <- airbnb_clean %>% select(price_per_night) %>% filter(price_per_night <= 433)
y <- x %>% select(price_per_night) %>% filter(price_per_night > 1)
y
Exercise 3: Describe the shape and skewness of the distribution.
— To Find the shape and skewness of price_per_night
Average
avg_price_per_night = mean(airbnb_clean$price_per_night)
avg_price_per_night
## [1] 70.17425
ggplot(airbnb_clean, aes(x = price_per_night)) +
geom_histogram(fill = 'skyblue', colour = 'black', binwidth = 267)+
geom_vline(xintercept = avg_price_per_night, linetype = "dashed", colour = 'red', size = 1) + ggtitle("Count of Price Per Night and showing it's Average Value")
med_price_per_night = median(airbnb_clean$price_per_night)
med_price_per_night
## [1] 44.5
ggplot(airbnb_clean, aes(x = price_per_night)) +
geom_histogram(fill = 'skyblue', colour = 'black', binwidth = 267) +
geom_vline(xintercept = med_price_per_night, linetype = "dashed", colour = 'blue', size = 1) +
ggtitle("Count of Price Per Night and showing it's Median Value")
ggplot(airbnb_clean, aes(x = price_per_night)) +
geom_histogram(fill = 'skyblue',colour = 'black', binwidth = 267) +
geom_vline(xintercept = avg_price_per_night, linetype = "dashed", colour = 'red', size = 1) +
geom_vline(xintercept = med_price_per_night, linetype = "dashed", colour = 'blue', size = 1) +
ggtitle("Count of Price Per Night and Comparing it's Mean and Median Value")
— The above plot is showing that Mean value(that is average price_per_night value) is greater than the Median Value that we expect for the Right skewed distribution and shape is Unimodel.
Exercise 4: Based on your answer to the previous question, decide if it is appropriate to apply a transformation to your data. If no, explain why not. If yes, name the transformation applied and visualize the transformed distribution.
p1 <- ggplot(airbnb_clean, aes(x = price_per_night)) +
geom_histogram(fill = 'skyblue', colour = 'black', binwidth = 267)
p2 <- ggplot(airbnb_clean, aes(x = price_per_night)) +
geom_histogram(fill = 'skyblue', colour = 'black') + scale_x_log10() + xlab("Log10 of Price")
p3 <- ggplot(airbnb_clean, aes(x = price_per_night)) +
geom_histogram(fill = 'skyblue', colour = 'black') + scale_x_sqrt() + xlab("Square Root of Price")
grid.arrange(p1, p2, p3)
— Yes I have applied the log10 transformation as my data was highly skewed.
Exercise 5: Choose and calculate an appropriate measure of central tendency.
The mean value is:
avg_price_per_night = mean(airbnb_clean$price_per_night)
avg_price_per_night
## [1] 70.17425
The median value is:
med_price_per_night = median(airbnb_clean$price_per_night)
med_price_per_night
## [1] 44.5
Percent Difference:
(avg_price_per_night - med_price_per_night) / med_price_per_night
## [1] 0.5769494
— The mean is greater than the median, which confirms the skewness noted in the histogram. — So we will take Median as center tendency.
Exercise 6: Explain why you chose this as your measure of central tendency. Provide supporting evidence for your choice.
— The mean is greater than the median, which confirms the skewness noted in the histogram.
— It would be better to use the median as a measure of central tendency given that the distribution is skewed. We know that the median is more robust to outliers than the mean so this should give us a better sense of the centre of the data distribution.
Exercise 7: Choose and calculate a measure of spread that is appropriate for your chosen measure of central tendency. Explain why you chose this as your measure of spread.
We should use the interquartile range as a measure of spread since it is a more robust measure than standard deviation in the presence of skewness/ outliers.
— The interquartile range is:
IQR(airbnb_clean$price_per_night)
## [1] 61.5
Exercise 1: Create an appropriate plot to visualize the distribution of counts for this variable.
— Counting neighbourhood_manually
airbnb_clean %>% count(neighbourhood_group)
ggplot(airbnb,aes(x=neighbourhood_group))+
geom_bar(stat="count", fill = 'orange')+
geom_text(aes(label=..count..),stat="count",position=position_stack(), ) + ggtitle("The count of properties listed in neighbourhood_group")
Exercise 2: Create an appropriate plot to visualize the distribution of proportions for this variable.
— Visualizing neighbourhood_group variable’s distribution of proportions.
ggplot(airbnb_clean, aes(x = neighbourhood_group, y = ..prop.., group = 1)) +
geom_bar(fill = "orange", color = "black", stat = "count") + ggtitle("Distribution of neighbourhood_group according to their Proportions")
— Manually calculating these proportions and verifing that the results are the same as what is shown in the previous bar plot.
airbnb_clean %>% group_by(neighbourhood_group) %>%
summarise(n = n()) %>%
mutate(prop = n / sum(n))
Exercise 3: Discuss any unusual observations for this variable?
— So we can say that Manhattan has the maximum number of properties on the other hand we can see that Staten Island has the minimum number of properties listed on Airbnb.
Exercise 4: Discuss if there are too few/too many unique values?
Desending
airbnb_clean %>% group_by(neighbourhood_group) %>%
summarise(n = n()) %>%
mutate(prop = n / sum(n)) %>%
arrange(desc(n))
Assending
airbnb_clean %>% group_by(neighbourhood_group) %>%
summarise(n = n()) %>%
mutate(prop = n / sum(n)) %>%
arrange(n)
— By viewing above values that Staten Island has the minimum number of properties that are listed on AIRBNB on the other hand we have Manhattan which is having the maximum number properties that are listed on AIRBNB.
Exercise 1: Create an appropriate plot to visualize the relationship between the two variables.
ggplot(airbnb_clean, aes(x = number_of_reviews, y = price)) +
geom_point(aes(size = price), alpha = 0.05, color = "slateblue") +
xlab("Number of reviews") +
ylab("Price") +
ggtitle("Relationship between number of reviews",
subtitle = "The most expensive objects have small number of reviews (or 0)")
Exercise 2: Describe the form, direction, and strength of the observed relationship. Include both qualitative and quantitative measures, as appropriate.
— Let’s reproduce the above plot but with a best fit line:
ggplot(airbnb_clean, aes(x = number_of_reviews, y = price)) +
geom_point(aes(size = price), alpha = 0.05, color = "slateblue") +
geom_smooth(method = 'lm', se=FALSE, color = "black") +
xlab("Number of reviews") +
ylab("Price") +
ggtitle("Relationship between number of reviews",
subtitle = "The most expensive objects have small number of reviews (or 0)")
— It appears that Price has a negative, linear relation to the Number of Reviews (or at least could potentially be modeled as such) but it appears to be somewhat of a weak relationship.
— The value of the correlation coefficient is on the weaker side, confirming what we see in the plot.
cor(airbnb_clean$price, airbnb_clean$number_of_reviews)
## [1] -0.04795423
Exercise 3: Explain what this relationship means in the context of the data.
— Although relatively weak, the relationship does show that, on average, as price decreases we can expect that number of reviwes will increase.
— The variation in Price clearly decreases as the number of reviwes are increasing.
— The data includes all types of room types with different minimum nights so that will explain the wide variation in price as the number of reviews increases. Since some properties have very less minimum number of nights to stay so the frequency of price and number reviews have a much stronger connection then others.
Exercise 4: Describe the variability that you observe in the plot and how that corresponds to the strength you calculated in #2 above.
— As the Strength is in quantitatively so we will use use the correlation coefficient for two numeric variables to understad the variability
— So the variability not at all close to 1 and also it has negative correlation coefficient and also has a very week relationship between price and number of reviews. We can see that from the linear line associated in the plot also. — The most expensive objects have small number of reviews (or 0) — Variability of the data seems not consistent across the Number of reviews.
Exercise 1: Create an appropriate plot to visualize the relationship between the two variables.
ggplot(airbnb_clean, aes(x = room_type, y = price)) +
geom_boxplot(alpha = 0.5) +
labs(x = 'Room Type', y = 'Price') +
scale_y_log10()
Exercise 2: Describe the form, direction, and strength of the observed relationship. Include both qualitative and quantitative measures, as appropriate.
— It appears that median Price is negatively related to Room Type. — The relationship appears to be and linearly related across the Room Type. — Strenght of the observed relationship is qualitatively and Variability is fairly consistent across the Room Type.
Exercise 3: Explain what this relationship means in the context of the data.
— In the context of data the relationship showing that the Entire home/apt are more costly and Shared rooms are the least expensive and we can also understand that entire home has more price then the private room and shared room.
Exercise 4: Describe the variability that you observe in the plot and how that corresponds to the strength you calculated in #2 above.
— Variability of the data seems fairly consistent across the Room Type. — We can see that the plot seems to be normally distributed after appling the log10 in price as price range is too big.
References: https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data
California State University Course Material
http://insideairbnb.com/new-york-city/
I, Kushal Patel, hereby state that we have not communicated with or gained information in any way from any person or resource that would violate the College’s academic integrity policies, and that all work presented is our own. In addition, we also agree not to share our work in any way, before or after submission, that would violate the College’s academic integrity policies.